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Abstract. Speech recognition system performance degrades in noisy environments. If the 
acoustic models (HMMs) for speech are built using features of clean utterances, the features 
of a noisy test utterance would be acoustically mismatched with the trained model. This 
gives poor likelihood values and poor recognition accuracy. Model adaptation and feature 
normalisation are two broad areas that address this problem. While the former often gives 
better performance, the latter involves estimation of lesser number of parameters, making 
the system feasible for practical implementations. 

This research focuses on the efficacies of various subspace, statistical and stereo based 
feature normalisation techniques. A subspace projection based method has been investig¬ 
ated as a standalone and adjunct technique involving reconstruction of noisy speech features 
from a precomputed set of clean speech building-blocks. The building blocks are learned 
using non-negative matrix factorisation (NMF) on log-Mel filter bank coefficients, which 
form a basis for the clean speech subspace. The work provides a detailed study on how 
the method can be incorporated into the extraction process of Mel-frequency cepstral coef¬ 
ficients. Experimental results show that the new features are more robust to noise, and 
achieve better results when combined with the existing techniques. 

The work also proposes a modification to the training process of the popular SPLICE 
algorithm for noise robust speech recognition. The modification is based on feature cor¬ 
relations, and enables this stereo-based algorithm to improve the performance in all noise 
conditions, especially in unseen cases. Eurther, the modified framework is extended to work 
for non-stereo datasets where clean and noisy training utterances, but not stereo counter¬ 
parts, are required. An MLLR-based computationally efficient run-time noise adaptation 
method in SPLICE framework has been proposed. 
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CHAPTER 1 


Introduction 


There are numerous applications of automatic speech recognition (ASR) in the present- 
day world. From personal mobile assistant apps for the urban population to providing com¬ 
modity price helplines for farmers in rural areas, speech recognition has made its way from 
the research laboratories to the common man. More important are the real-time interact¬ 
ive voice response (IVR) based applications for information retrieval and storage, such as 
automated primary health care, personalised farming, price retrieval, ticket booking and 
so on. 

Such applications, developed on low-end processors, require instant speech-to-text con¬ 
version robustly under varying background noises and microphone distortions. Numerous 
techniques have been proposed to improve the noise robustness of speech recognition. 
These improved accuracies come at the cost of higher computational complexity. The use 
of low-end processors limit the processing capability, and also consume time to perform 
intense computational tasks. This delays the response of the system to the user, making it 
annoying and often unusable. Thus only a few set of the techniques are suitable for robust 
real-time ASR systems, and there is an interest in understanding and improving them. 

Typical back-end of an ASR system consists of a set of generative hidden Markov models 
(HMMs) which are trained for each sound unit. Features extracted from the user’s speech 
data are compared against the HMMs using a process called Viterbi decoding to get the 
most likely word sequence that could have been spoken. To overcome the effect of noise 
and distortion, a class of techniques normalise the features of test speech so that they have 
similar characteristics to those of training. 

This thesis concentrates on these feature normalisation techniques suitable for improv¬ 
ing robustness in real-time applications. Work has been done on two scenarios, during the 
training process of the back-end of the system, viz., 

(1) there is no information about noise 

(2) some noisy training speech data are available 

1.1. Motivation 

Examples of feature normalisation techniques are cepstral mean and variance norm¬ 
alisation (CMVN), histogram equalisation (HEQ), stereo-based piecewise linear compens¬ 
ation for environments (SPLICE) etc. While some of the techniques operate at utterance 
level and estimate statistical parameters out of the test utterances, some operate on in¬ 
dividual feature vectors. Parameter estimation in the former set of techniques requires 
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sufficiently long utterances for reliable estimation, and also some processing. The latter 
are best suited for shorter utterances and real-time applications. 

The standard features used for speech feature extraction are the Mel-frequency cepstral 
coefficients (MFCCs). In the final stage of their extraction process, a discrete cosine trans¬ 
form (DCT][^is used for dimensionality reduction and feature decorrelation. An interesting 
question to ponder at this juncture is: whether DCT is the best method to achieve feature 
dimentionality reduction and feature decorrelation along with noise robustness. 

Heteroscedastic linear discriminant analysis (HLDA) [ Kumar, 1997 1 is one technique 
which achieves these qualities under low noise conditions, even though it was originally 
designed for a different objective. The technique fails in high noise environments. This 
thesis proposes using non-negative matrix factorisation (NMF) [ Lee and Seung, 1999 ] and 
reconstruction of the features prior to applying the DCT in the standard MFCC extraction 
process. Two methods of achieving noise robust features are discussed in detail. These 
techniques do not assume any information about noise during the training process. 

It is also reasonable to identify the commonly encountered noise environments during 
training, and learn their respective compensations. During the real-time operation of the 
system, the background noise can be classified into one of the known kinds of noise so that 
the corresponding compensation can be applied. An example of one such technique is the 
stereo-based piecewise compensation for environments (SPLICE) |Deng et al., 2000 j . This 
technique requires stereo data during training, which consists of simultaneously recorded 
speech using two microphones. One is a close-talk microphone capturing mostly clean 
speech and the other is a far-held microphone, which captures noise along with speech. 

This technique is of particular interest because it operates on individual feature vec¬ 
tors, and the compensation is an easily implementable linear transformation of the fea¬ 
ture. However, there are two disadvantages of SPLICE. The algorithm fails when the test 
noise condition is not seen during training. Also, owing to its requirement of stereo data 
for training, the usage of the technique is quite restricted. This thesis proposes a modihed 
version of SPLICE that improves its performance in all noise conditions, predominantly dur¬ 
ing severe noise mismatch. An extension of this modihcation is also proposed for datasets 
that are not stereo recorded, with minimal performance degradation as compared to the 
conventional SPLICE. To further boost the performance, run-time adaptation of the para¬ 
meters is proposed, which is computationally very efficient when compared to maximum 
likelihood linear regression (MLLR), a standard model adaptation method. 


1.2. Overview of the Thesis 

The rest of the thesis is organised as follows. Chapter [^summarises the required back¬ 
ground and revises some existing techniques in literature. Chapter [^ discusses the pro¬ 
posed methods of performing feature compensation using NME during MECC extraction, 
and assumes no information about noise during training. Chapter [^ details the proposed 
modifications and techniques using SPLICE. Einally, Chapter concludes the thesis, indic¬ 
ating possible future extensions. 

^DCT, by default hereafter, refers to Type-II DCT 








CHAPTER 2 


Background 


2.1. HMM-GMM based Speech Recognition 

The aim of a speech recognition system is to efficiently convert a speech signal into a 


text transcription of spoken words [ jRabiner and Schafer, 2010p . This requires extract¬ 
ing relevant features from the largely available speech samples, which is called feature 
extraction. By this process, the speech signal is converted into a stream of feature vectors 
capturing its time-varying spectral behaviour. 

The feature vector streams of each basic sound unit are statistically modelled as HMMs 
using a training process called Baum-Welch algorithm. This requires sufficient amount of 
transcribed training speech. A dictionary specifying the conversion of words into the basic 
sound units is necessary in this process. During testing, an identical feature extraction pro¬ 
cess, followed by Viterbi decoding are performed to obtain the text output. The decoding 
is performed on a recognition network built using the HMMs, dictionary and a language 
model. Language model gives probabilities to sequences of words, and helps in boosting 
the performance of the ASR system. 

This process is summarised in Figure |Z1.1[ More detailed explanation can be found in 
[Rabiner and Schafer, 20101. The choice of the basic sound unit depends on the size of 
the vocabulary. Word models are convenient to use for tasks such as digit recognition. For 
large vocabulary continuous ASR, triphone models are built, which can be concatenated 
appropriately during Viterbi decoding to represent words. 


2.2. MFCC Feature Extraction 

Apart from the information about what has been spoken, a speech signal also contains 
distinct characteristics which vary with the recording conditions such as background noise, 
microphone distortion, reverberation and so on. Speaker dependent parameters such as 
vocal tract structure, accent, mood and style of speaking also affect the signal. Thus an 
ASR system requires robust feature extraction process which captures only the speech in¬ 
formation, discarding the rest. 


2.2.1. MFCCs. MFCCs have been shown to be one of the effective features for ASR. 
The extraction of MFCCs is summarised in Figure 2.2. 1| and involves the following steps: 


(1) Short-time processing: Convert the speech signal into overlapping frames. On 
each frame, apply pre-emphasis to give weightage to higher formants, apply a 
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Figure 2.1.1. Overview of speech recognition system 


hamming window to minimise the signal truncation effects, and take magnitude 
of the DFT. 

(2) Log Mel filterbank (LMFB); Apply a triangular filterbank with Mel-warping and 
obtain a single coefficient output for each filter. Apply log operation on each 
coefficient to reduce the dynamic range. These operations are motivated by the 
acoustics of human hearing. 

(3) DCT: Apply DOT on each frame of LMFB coefficients. This provides energy com¬ 
paction useful for dimensionality reduction, as well as approximate decorrelation 
useful for diagonal covariance modelling. 
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Figure 2.2.1. Extraction process of MFCCs 

A cepstral lifter is generally used to give approximate equal weightage to all the coefficients 
in MFCCs (the amplitude diminishes due to energy compaction property of DCT). Delta and 
acceleration coefficients are also appended to the liftered MFCCs, to capture the dynamic 
information in the speech signal. Finally cepstral mean subtraction (CMS) is performed on 
these composite features, to remove any stationary mismatch effects caused by recording 
in different environments. 


2.3. Need for Additional Processing 

Let us look at the robustness aspects of MFCC composite features in response to two 
kinds of undesired variations, viz., speaker-based and environment-based. 

During MFCC extraction, most of the pitch information is discarded by the smooth¬ 
ing operation of LMFB. Some speaker-specific characteristics are removed by truncating 
the higher cepstral coefficients after DCT operation. However, other variations such as 
those occurring due to differences in vocal-tract structures can be compensated for better 
recognition results. 

MFCC composite features are less robust to environment changes. The presence of 
background noise, especially during testing, causes serious non-linear distortions which 
cannot be compensated using CMS. This acoustic mismatch between the training and test¬ 
ing environments needs additional compensation. 

2.3.1. Feature Compensation and Model Adaptation. The techniques which operate 
on features to nullify the undesired effects are called feature compensation techniques. 
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Examples are CMVN, HEQ. These methods are usually simple and can be incorporated 
into real-time applications. So there is an interest in understanding and improving these 
techniques. 

On the contrary, there are a class of model adaptation techniques which refine the 
models to compensate the effects. These are usually computationally intense and yield high 
recognition rates. Examples are maximum likelihood linear regression (MLLR), speaker 
adaptive training (SAT). 

Some techniques such as joint factor analysis (JEA), joint uncertainty decoding (JUD) 
use a combination of both feature and model compensation to further improve the recog¬ 
nition. 

However, this thesis focuses on feature compensation done on frame-by-frame basis, 
due to their suitability to real-time applications. 


2.4. A Brief Review of Some Techniques Used 

2.4.1. HLDA. Given features belonging to various classes, this technique aims at achiev¬ 
ing discrimination among the classes through dimensionality reduction. It linearly trans¬ 
forms D dimensional features such that only R {< D) dimensions of the transformed 
features have the discriminating capability, and the remaining D — R dimensions can be 
discarded. In ASR, the features assigned to each state of an HMM during training are 
typically considered as belonging to a class. The transformation H, is estimated from the 
training data using class labels obtained from their first-pass transcriptions. The R dimen¬ 
sional new features are then used to train and test the models. 

Estimation of HLDA transform 

Let the feature y be transformed to obtain x = l-Ly. Let denote that the class-label 
of y is k. Eor each class k, the set are assumed to be Gaussian, and thus are their 

corresponding }. It is desired that the last D — R dimensions of x do not contain any 
discriminative information, i.e., the mean and covariance of all the K classes are 
identical in their last D — R dimensions, as 


fJ-k = 





0 

0 So 


where /Xg and Sq are of dimensions {D — R) x 1 and {D — R) x {D — R) respectively. 
If this is achieved, the last D — R dimensions of x^^^ can be discarded without loss of 
discriminability. 

The density of each y^^^ is modelled as 

(27r)t|Sfc|5 

Numerical solution for l-L can be derived by differentiating the log-likelihood function 
= Ef=ilogp(y(''^) w.r.t H., and using the maximum likelihood estimates of /x^ and S^ 
obtained from L | |Kumar, 1997] . 
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2.4.2. HEQ. HEQ techniques are used to compensate for the acoustic mismatch between 
the training and testing conditions of an ASR system, thereby giving improved perform¬ 
ance. HEQ is a feature compensation technique which defines a transformation that maps 
the distribution of test speech features onto a reference distribution. As shown in Figure 


2.4.1 a function x = g{y) can be learned such that any noise feature component yo can be 


transformed to its corresponding equalised version xq. 



Figure 2.4.1. Illustration of histogram equalisation 


The reference distribution can be that of either clean speech, training data or even 
a parametric function. Both the training and test features need to be equalised to avoid 
mismatch. Since HEQ matches the whole distribution, it matches the means, variances and 
all other higher moments. 

Let y be a component of a noisy speech feature vector modelled as frestiy)- The 
relation between the cumulative distribution of y and that of its equalised version x is 
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given by 


Frest {y) = / hest [y') dy' 


r9{y) 

/ hest {g (a:')) 

J — oo 


-1 


dx' 


fTrain ) dx 


^=g{y) 


— FTrain (t) 


In practice, the function g (y) is implemented using quantile-based methods [Hilger 


and Ney, 20061. 


2.4.3. MLLR. MLLR is a widely used adaptation technique based on maximum-likelihood 
principle [ Legetter, 1995| Gales, 1998 1. It performs adaptation through regression-based 
transformations on (usually) the mean vectors of the system of HMMs, m being the 
mixture index. The transformations are estimated such that the original system is tuned to 
a new speaker or environment. 


where Adm is an MLLR transformation matrix of dimension D x (D + 1), and fj.'m = 
[l estimate of AAm is obtained by maximising the likelihood function 


£ 


r . D , , 1 ^ 

(27r) 2 |S,„|2 


In this work, MLLR mean adaptation has been used as a global transform to adjust the 
system of HMMs to a new noise environment encountered during testing. The adaptation 
data are the same as test data, which consist of files recorded in a particular noise condition 
with sufficient speaker and speech variations. 


2.4.4. SPLICE. SPLICE is a popular and efficient noise robust feature enhancement 
technique. It partitions the noisy feature space into M classes, and learns a linear trans¬ 
formation based noise compensation for each partition class during training, using stereo 
data. Any test vector y is soft-assigned to one or more classes by computing p{m\y) 
(m = 1,2,..., M), and is compensated by applying the weighted combination of linear 
transformations to get the cleaned version x. 

M 

(2.4.1) ^ = E p{m\y) (Amy+ bm) 

m=l 

Am and bm are estimated during training using stereo data. The training noisy vectors {y} 
are modelled using a Gaussian mixture model (GMM) p{y) of M mixtures, and p{m\y) is 
calculated for a test vector as a set of posterior probabilities wr.t the GMM p{y). Thus the 
partition class is decided by the mixture assignments p (m | y). This is illustrated in Figure 
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Figure 2.4.2. Illustration of SPLICE feature enhancement 


2.4.5. NMF. NMF is an approximate matrix decomposition 


(2.4.2) 


V 

{DxN) 


W H 

{DxR){RxN) 


where V, consisting of non-negative elements, is decomposed into two non-negative matrices 
W and H. In the context of speech data, the columns of V constitute non-negative fea¬ 
ture vectors {v„} G n = 1,2,..., A^. After the decomposition, W is a non-negative 

dictionary of basis vectors {w^} G r = along columns, representing a 

useful subspace in which {v„} are contained, when R < D. H G is a matrix of 

vectors {h„} G , n = 1,2,..., such that each h„ = {/iin, /i 2 n, • • •, hun} is a set of 
weights or activation coejficients acting upon all the bases {w^} to give the corresponding 


v,^, independently of all the other columns, as shown in equation |2. 4.3 

R 


(2.4.3) 


r=l 


Wrhr 


The decomposition ( 2.4.2| ) can be learned by randomly initialising W and H. The 
estimates of W and H are iteratively improved by minimising a KL-divergence based cost 
function 


(2.4.4) V(V II WH) = 

d,n 

where (g) denotes Hadamard product, the division inside the log is element-wise and the 
summation is over all the elements of the matrix. The optimisation 

argmin I?(V || WH) W > 0, H > 0 

W,H 


V ® log 


V 

WH 


- V +WH 


- dn 
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gives the following iterative update rules | |Lee and Seung, 1999| |Lee and Seung, 2000 1 
for refining the matrices W and H: 

( 2 . 4 . 5 ) Wdr := hrnVdnl[^^\dn 


(2.4.6) 


hrY) •- 


Ed^drVdn/[Wn\ 


dn 


T.dWdr 

where refers to assignment operator. It can be seen that the update rules are multi¬ 
plicative, i.e., the matrices are updated by performing just a product with another matrix, 
making the implementation simple and quick. The columns of W can be thought of as 
basic building blocks that can reconstruct speech features. 

2.5. Recent Work in Literature - Motivation 


I Sainath et al., 20121 showed an overview of a wide range of techniques in ASR that 


use speech exemplars (dictionaries). It was argued that noise robustness can be achieved 
through the use of speech dictionaries. However, most of these techniques are computa¬ 
tionally intense. 

Since NMF is one of the methods of learning dictionaries and is easily implementable, 
it is of particular interest. NMF is known to learn useful time-frequency patterns in a 
given dataset, and has been applied to learn spectral representations in audio and speech 
applications that include audio source separation, supervised speech separation, speech 
enhancement and recognition. A few of them are mentioned below. 


A regularised variant of NMF was used by [ |Wilson et al., 2008J to learn separate 
dictionaries of speech and noise. The concatenated dictionary was used in NMF to learn 
weights of noisy test utterances, where the weights corresponding to noise bases are sup¬ 


pressed to achieve speech denoising and thus enhancement. [ |Schuller et al., 2010J pro¬ 
posed supervised NMF for improving noise robustness in spelling recognition task. NMF 
was performed using a predetermined W consisting of spectra of spelled letters. The au¬ 
thors showed that appending the weights H to MFCC and BLSTM (Bidirectional Long 
Short-Term Memory neural network) features improves noise robustness. HGemmel^ 


et al., 2011[ | represented LMFB features of noisy speech using exemplars of speech and 
noise bases. A hybrid (exemplar and HMM based) recognition on Aurora-2 task was per¬ 
formed to achieve noise robustness at high noise levels. 

Most of the above techniques have used the weights H as new features or combining 
them with existing features for improved recognition. In contrast. Chapter of this thesis 
will show that multiplying back W and H to get new LMFB features and converting them 
to MFCCs improves noise robustness. This approach is useful in real-time applications 
because of its fast implementation. A technique which learns a robust W will also be 
discussed. These methods do not assume any information about noise during training. 

When there are noisy training files available, techniques such as SPLICE learn com¬ 
pensation from the seen noisy data to obtain their corresponding clean versions. Over the 


last decade, improvements using uncertainty decoding [Droppo et al., 20021, maximum 
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mutual information based training [ |Droppo and Acero, 200^ , speaker normalisation 
I Shinohara et al., 2008 1 etc. were introduced in SPLICE framework. There are two dis¬ 
advantages of SPLICE. The algorithm fails when the test noise condition is not seen during 
training. Also, owing to its requirement of stereo data for training, the usage of the tech¬ 
nique is quite restricted. So there is an interest in addressing these issues. 

I Chijiiwa et al., 2012 1 recently proposed an adaptation framework using Eigen-SPLICE 
to address the problem of unseen noise conditions. The method involves preparation of 
quasi stereo data using the noise frames extracted from non-speech portions of the test ut¬ 
terances. Eor this, the recognition system is required to have access to some clean training 
utterances for performing run-time adaptation. 

[ Gonzalez et al., 2011 1 proposed a stereo-based feature compensation method, which 
is similar to SPLICE in certain aspects. Clean and noisy feature spaces were partitioned into 
vector quantised (VQ) regions. The stereo vector pairs belonging to VQ region in clean 
space and VQ region in noisy space are classified to the sub-region. Transformations 
based on Gaussian whitening expression were estimated from every noisy sub-region to 
clean sub-region. But it is not always guaranteed to have enough data to estimate a full 
transformation matrix from each sub-region to other. 

In Chapter]^ a simple modification to SPLICE will be proposed, based on an assump¬ 
tion made on the correlation of training stereo data. This will be shown to give improved 
performance in all the noise conditions, predominantly in unseen conditions which are 
highly mismatched with those of training. An extension of the method to non-stereo data¬ 
sets (which are not stereo recorded) will be proposed, with minimal performance degrad¬ 
ation as compared to conventional SPLICE. Einally an MLLR-based run-time noise adapt¬ 
ation framework will be proposed, which is computationally efficient and achieves better 
results than MLLR-based model adaptation. 






CHAPTER 3 


Non-Negative Subspace Projection 

3.1. Introduction 

During MFCC extraction, usually 23 LMFB coefficients are converted into 13 dimen¬ 
sional MFCCs through DCT. Results in Table show that performing HLDA on 39 dimen¬ 
sional MFCCs gives better robustness to noise than MFCCs when the noise levels are low 
But this technique fails in high noise conditions. However, the objective of HLDA is not to 
achieve robustness, but class-separability. Here a method is proposed which aims at finding 
representations of speech feature vectors using building-blocks. 

The method operates on LMFB feature vectors by representing them using non-negative 
linear combinations of their non-negative building blocks. DCT-II is applied on the new 
feature vectors to obtain new MFCCs. This incorporates, into the MFCC extraction pro¬ 
cess, a concept that the speech features are made up of underlying building blocks. Apart 
from the proposed additional step, conventional MFCC extraction framework is maintained 
throughout the process. The building blocks of speech are learned using NMF on speech 
feature vectors. Experimental results show that the new MFCC features are more robust 
to noise, and achieve better results when combined with the existing noise robust feature 
normalisation techniques like HEQ and HLDA. 



Basis vector 
Clean feature vector 
Noisy feature vector 
Reconstructed vector 




> 

Xl 


Figure 3.1.1. NMF subspace projection 
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3.2. The Speech Subspace 

The columns of W can be thought of as basic building blocks that construct all the 
speech feature vectors, or the bases for subspace of the speech feature vectors. So far, no 
mathematical proof has been derived to validate if NMF learns the representations of the 


underlying data. However, much of the literature support this. [ |Smaragdis, 200^ states 
that the basis functions describe the spectral characteristics of the input signal components, 
and reveal its vertical s tructure. []Wilson et al., 2008 1 refer to these building blocks as a 
set of spectral shapes. [ Schuller et al., 2010 1 refer to them as spectra of events occurring 
in the signal, and NMF is known to learn useful time-frequency patterns in a given dataset. 

One could probably use other basis learning techniques like principal component ana¬ 
lysis (PGA) to estimate the clean subspace. Though PGA finds the directions of maximum 
spread of the features, these directions need not contain only speech information. In meth¬ 
ods such as NMF, new features can be reconstructed within the subspace using different 
cost functions to achieve desired qualities such as source localisation, sparseness etc. KL- 


divergence based NMF has been successfully applied in speech applications [Wilson et al., 
[2008t [Schuller et al., 2010| over the conventional Euclidean distance measure. In addi¬ 
tion, the non-negativity constraint is an added advantage. Figure 3.1.1 shows the bases 
wi and W 2 learned by NMF on two-dimensional data contained in the subspace shown. 
Any vector outside this subspace cannot be reconstructed perfectly using wi and W 2 when 
the weights acting on them are constrained to be non-negative. So when these features 
are moved away from the subspace due to the effect of noise, the reconstructed features 
can be used in place of these features as better representations of the underlying signal. 
However, vectors still inside the subspace are not compensated. 

3.3. Learning the Speech Subspace 

NMF can be performed by optimising cost functions based on different measures such 
as Euclidean, KL-divergence, Itakura-Saito distances. Here KL-divergence based method is 
chosen, which gives the update equations ( |2.4.5[ ) and ( 2.4.6[ ). The decomposition is not 
unique since the cost function ( |2.4.4[ ) to be optimised is not convex. Depending on the 
application, one may choose to perform update on both W and H in each iteration, or fix 
one matrix and update the other. While simultaneously refining W and H, the columns of 
W may be normalised after each update so that the sum of each column adds up to value 
1. The scaling is automatically compensated in weight matrix H during its update. 

Eig. [3.2.1a[ shows the plot of LMEB outputs of an utterance containing connected 


spoken digits. Using the dictionary learned from Aurora-2 database (as described in Sec¬ 
tion [3A^, the reconstructions of the same utterance from each individual basis vector of 
W are plotted in Eig. 3.2. lb| It can be observed that each of the reconstructions has only a 
set of particular dominant frequencies that are captured by the corresponding basis vector. 
Such a dictionary can capture useful combinations of frequencies present in speech utter¬ 
ances, and can give noise-robust speech reconstructions. The reconstructed feature vectors 
are more correlated, due to their confinement to the speech subspace, than the original 


ones. 
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(a) Original features 


3 
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time (seconds) 


(b) Reconstruction using NMF individual basis vectors 


Figure 3.2.1. LMFB features of a speech utterance 


In PCA method, the speech dictionary W can be built by stacking the most significant 
Eigen vectors of the clean training data as columns. Any speech feature v can be recon¬ 
structed as 

9 = WW^v 


3.4. Proposed Feature Extraction Methods 

Speech signal is passed through conventional short-time processing steps followed by 
an LMEB. Before applying DCT, which corresponds to obtaining the conventional MECC 


features, an additional step is proposed to be introduced as shown in Eig. 3.4.1 so that the 
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Figure 3.4.1. Computation of subspace projected MFCCs 


new MFCCs obtained after the processing are more robust to noise. The additional step 
is computed in two different methods as described in Sections |3.4.1| and |3.4.2| and the 


performances are compared in Sections 4.5.2 


Short-time processing of speech signal includes applying STFT with Hamming window 
of length 25ms at a frame rate of 100 frames/second, followed by passing the mean sub¬ 
tracted frames through a pre-emphasis filter 1 — 0.97z~^. A conventional Mel-filter bank of 
23 constant bandwidth equal gain linear triangular filters on the Mel scale is applied to get 
a set of filter outputs for each frame. These outputs are Mel-floored to value 1.0, and log 
operator is applied to obtain non-negative LMFB feature vectors {v„}. 

Let the LMFB feature vectors {v^} and {v;^} corresponding to the training and test 
speech be stacked as columns of and V<^ respectively. 


3.4.1. NMF_plain. V<^ is decomposed as using NMF decomposition by 


simultaneously updating Wa and Ha by (2.4.5) and (2.4.6). Each column of Wa is norm¬ 


alised after every iteration. Finally, learns the building blocks of these feature vectors. 
During testing, is approximated as the product of and (i.e., sa 
This is done by fixing obtained during training and performing update on alone 
using (2.4.6|). The new feature vectors in training and testing are thus 


(3.4.1) 
and 

(3.4.2) 






respectively. In the implementation, the whole training data has been taken as to 
compute using NMF. However, during testing, each test utterance can separately be 
taken as and the corresponding H<^ can be computed, fixing MV 

Here the assumption is that many of the columns of are initially outside the speech 
subspace due to the effect of noise. If each of these vectors outside the subspace is mapped 
to the nearest vectors within the subspace, the new features are more noise-robust. 
If noise moves a clean feature vector to another vector in the same subspace, it cannot 
be compensated. Here the subspace is captured by W^, and the term nearest is meant 
in the measure of KL-divergence distance. Replacement of V^, by V<^ can alternatively be 
justified as follows. Each v,^ is being represented by non-negative linear combinations of 
building blocks of speech data, because of which any noise component in v<^ cannot be 
reconstructed using speech bases, and hence cannot be retained in v^,. 

Now, by convention, DCT matrix D, followed by a standard cepstral lifter L are applied 
on the features V(^ and V<^ to obtain 13 dimensional MECC features (Co ... C 12 ) for each 
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Figure 3.4.2. Learning the speech dictionary in NMF_robustW 


frame, given by Eqs. ( |3.4.3| ) and ( |3.4.4| ). 

(3.4.3) = LDV^ 


(3.4.4) 


= LDV^ 


These features are used to parameterise the acoustic models (HMMs), as explained in 
Section [3.6.1[ 

3.4.2. NMF_robustW. Each set of basis activation coefficients h„ (for both training 
and test data) is unique for the corresponding speech frame. An addition of noise changes 
the statistics of h„. So it is intuitive that equalising the statistics of H<^ during testing, 
to match that of training H^, improves the recognition. The equalisation has to be ap¬ 
plied during training also, to avoid mismatch of the test features against the built acoustic 
model. Here the intention is not to perform test feature equalisation explicitly, but to get a 
dictionary whic h help s learn the test weights that are in equalised form, and thus are 

shows a method of obtaining the better dictionary W(^ from W^, 


more robust. Eigure 
H(j and V^, using H 


3.4.2 

Q during training. 

using NME, a set of R reference histograms 


After decomposing as 
{Qr} of are calculated, one for each of its feature component. HEQ is performed on 
H^, as described in Section 3.5 to get better or more robust activations H(^. But no 


longer matches V^. So a better is estimated using NME update (2.4.5), i.e., by fixing 
H^, V(^ and updating W. This essentially solves the optimisation 

~ WH^) 


= argmin D(V(^ 

w 


W > 0, > 0 


In other words, the speech dictionary is chosen such that the weights become statist¬ 
ically equalised during training. Experimental results show that W(^ is a better dictionary 
than W<;; in terms of noise-robustness. The training features are thus 


(3.4.5) 




Testing process is the same as described in Section 


3.4.1 


except that the new is 


directly used to estimate the test data weights H^, instead of W^. Additional equalisa¬ 


tion is not performed during testing, since the dictionary itself helps in learning equalised 
weights. The test features are thus 


(3.4.6) 


= W^H, 
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As per convention, DCT and cepstral lifter matrices are applied on and V^, given 
by Eqs. ( |3.4.3 1 and ( 3.4.4 ) to obtain the 13 dimensional MFCC feature vectors for training 
and testing the HMMs. Here it can be seen that the computational cost of both NMF_plain 
and NMF_robustW are the same during testing. 

3.5. Cascading with Existing Techniques 

The proposed methods can be cascaded with HEQ, where reference histogram can be 
built using C^, and HEQ can be applied on and to get new features vectors, using 
which acoustic models can be built as described in Section [3.6.1[ 

Since the feature vectors given by Eqs. (|3.4.1|) and (3.4.2) are constrained to the 


column space of W^, there is certain additional correlation introduced in them. Their cor¬ 
responding cepstral features and are made approximately decorrelated through the 
use of DCT. However, an additional decorrelation using HLDA is expected to further reduce 
the feature correlations, besides utilising the advantage of subspace projection. This also 
makes them more suitable for diagonal covariance modelling. The HLDA transformation 
matrix is estimated in maximum likelihood (ML) framework after building the acoustic 


models using 39 dimensional MFCCs as described in [ |Kumar and Andreou, 199^ , and is 
applied to both feature vectors and the models in the conventional method. 

3.6. Experiments and Results 


3.6.1. Experimental Setup. Aurora-2 task | Hirsch and Pearce, 20001 has been used 


to perform a comparative study of the proposed techniques versus the existing ones. Aurora- 
2 consists of connected spoken digit utterances of TIDigits database, filtered and resampled 
to 8 kHz, and with noises added at different SNRs. The noises are sounds recorded in 
places such as train station, crowd of people, restaurant, interior of car etc. The availab¬ 
ility of both clean and noisy versions of the training speech utterances makes them stereo 
in nature. The test set consists of 10 sets of utterances, each with one noise environment 
added, and each at seven distinct SNR levels. 

The acoustic word models for each digit have been built using left to right continuous 
density HMMs with 16 states and 3 diagonal covariance Gaussian mixtures per state. For 
all the experiments, Co included MFCC vectors of 13 dimensions, obtained from the signal 
processing blocks, are appended with 13 delta and 13 acceleration coefficients to get a 
composite 39 dimensional vector per frame. Cepstral mean subtraction (CMS) has been 
performed on these vectors, and the resultant feature vectors are used for building the 
acoustic models for each digit, which in Aurora-2 task is a left to right continuous density 
HMM with 16 states and 3 diagonal covariance Gaussian mixtures per state. HMM Toolkit 
(HTK) 3.4.1 [ Young et al., 2009] has been used for building and testing the acoustic 
models. 

NMF has been implemented using MATLAB software. The size of the speech dictionary 
(in other words, the dimensionality of the speech subspace) has been optimised and chosen 
as i? = 20 throughout the experiments based on recognition results. It is to be noted that 
during NMF, LMFB feature vectors are of dimension D = 23. While performing NMF, W 
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Size of the Dictionary 


Figure 3.6.1. Results of PCA as a subspace projection step 


has been initialised from random columns of V, and H with random numbers in [0,1]. 500 
iterations of NMF are performed in all the experiments. 

For performing HEQ, quantile based method has been employed, dividing the range 
of cdf values into 100 quantiles. In the experiments including HLDA, all 39 directions are 
retained, since the aim is to observe only the performance improvement after nullifying 
the correlations. 


3.6.2. Results. Table shows the recognition accuracies of the various techniques on 
Aurora-2 database. Average values shown are taken over SNR levels 20 — 0 dB. Table 


la shows the accuracies of proposed methods. Tables lb and Ic show how the meth¬ 


ods improve when cascaded with HLDA and HEQ respectively. It can be observed that 
HLDA can be used to achieve noise robustness in low noise levels only, and fails at high 
noise levels (SNR 5 to —5 dB). In the PCA method as shown in Eigure 3.6.1[ a diction¬ 
ary of size 23 is an orthogonal matrix which retains all the directions, thus correspond¬ 
ing exactly to the baseline system. It can be seen that this method hardly gives an im¬ 
provement over the baseline at different dictionary sizes. NME_plain gives 1.86% abso¬ 
lute improvement in recognition accuracy, NME_plain cascaded with HLDA gives 5.09%, 
NME_robustW gives 5.96%, NME_robustW-l-HLDA gives 8.6%. NME_plain-l-HEQ gives 
13.69% and NME_robustW gives 13.48% improvement. 

Specifically at SNR 5 dB, the individual proposed methods NME_plain and NME_robustW 
give absolute improvements by 4.61% and 11.76% respectively. When combined with HEQ, 
the highest improvement achieved is 27.3% at SNR 5 and 0 dB. Eigure 3.6.1 shows the 
accuracy of using PCA as a feature processing step at different sizes of the dictionary. 

The combination of NMP_plain-l-HEQ-l-HLDA achieved a recognition accuracy of 81.26%. 
This is not a comparable improvement over the other proposed techniques, considering the 
increased computational complexity. 
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Table 1. Subspace projection on Aurora-2 database 


(a) Individual methods 


Noise Level 

Baseline 

NMF plain 

NMF robustW 

Clean 

99.25 

99.27 

99.02 

SNR 20 

97.35 

97.50 

97.45 

SNR 15 

93.43 

94.17 

95.00 

SNR 10 

80.62 

82.73 

86.94 

SNR 5 

51.87 

56.48 

65.48 

SNRO 

24.30 

25.97 

32.46 

SNR -5 

12.03 

12.46 

12.68 

Average 

69.51 

71.37 

75.47 


(b) Cascading with HLDA 


Noise Level 

HLDA 

NMF plain -1- HLDA 

NMF robustW -1- HLDA 

Clean 

99.35 

99.35 

99.28 

SNR 20 

98.11 

98.25 

98.26 

SNR 15 

94.84 

95.69 

96.20 

SNR 10 

82.08 

87.14 

89.76 

SNR 5 

49.85 

63.63 

70.95 

SNRO 

21.64 

28.31 

35.41 

SNR -5 

11.15 

11.47 

12.97 

Average 

69.30 

74.60 

78.11 


(c) Cascading with HEQ 


Noise Level 

HEQ 

NMF plain -1- HEQ 

NME robustW -1- HEQ 

Clean 

99.03 

98.94 

98.73 

SNR 20 

97.68 

97.76 

97.21 

SNR 15 

95.39 

95.93 

95.26 

SNR 10 

90.00 

91.49 

90.67 

SNR 5 

75.61 

79.21 

78.79 

SNRO 

46.66 

51.63 

53.04 

SNR -5 

18.37 

19.54 

21.28 

Average 

81.07 

83.20 

82.99 


3.7. Discussion 


Methods described in Sections |3.4.1| and |3.4.2| are observed to give improvements in 
all noise conditions at all SNR levels, and give significantly better performances in mod¬ 
erate to high noise conditions. NMF_robustW-l-HEQ does not give improvement over 
NMF_plain-l-HEQ, both of them perform almost equally well. So when almost equalised 
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weights Y^^p are obtained through NMF_robustW, there is no advantage when an additional 
equalisation is done in cepstral domain. 

The proposed methods have notable advantages of over other techniques in literature. 
There is no need of building a separate dictionary for capturing noise characteristics, for 
which training audio files containing pure noise would have been required. The meth¬ 
ods operate on each speech frame independently, and so they can even handle very short 
utterances, unlike HEQ where the estimate of Ftestix) will be poor for short utterances. 
The methods are seen to give advantage when combined with other feature normalisation 
techniques. Finally the proposed methods are simple, easy to implement, and are achiev¬ 
able without the use of some commonly used additional tools like speech/silence detector, 
pre-built CDHMM model etc. 

The proposed methods have a few disadvantages. The decomposition is iterative and 
the step size of the update rules ( 2.4.5 ) and ( 2.4.6 ) is small. So it takes many iterations 
to converge, and the number of iterations increases when the size of the database is large. 
Dictionary estimation from a very large database is computationally expensive and is lim¬ 
ited by the availability of memory in the computing device. Iterations have to be performed 
even during testing, to determine the weights H^. 

Thus the concept of building-blocks representation of speech is incorporated into the 
features, still preserving the advantages of using MFCCs. Conventional HMM based re- 
cogniser has been used to test the efficacy of the proposed MFCCs against the standard 
MFCCs. 




CHAPTER 4 


Stereo and Non-Stereo Based Feature Compensation 


In this chapter, the techniques based on SPLICE are studied. A simple modification to 
SPLICE is proposed, based on an assumption made on the correlation of training stereo 
data. This is shown to give improved performance in all the noise conditions, predom¬ 
inantly in unseen conditions which are highly mismatched with those of training. This 
method does not need any adaptation data, in contrast to the recent work proposed in 


literature [ ]Chijiiwa et al., 2012] , and has been termed as modified SPLICE (M-SPLICE). 
M-SPLICE has also been extended to work for datasets that are not stereo recorded, with 
minimal performance degradation as compared to conventional SPLICE. Einally an MLLR- 
based run-time noise adaptation framework has been proposed, which is computationally 
efficient and achieves better results than MLLR HMM-adaptation. This method is done 
on 13 dimensional MECCs and does not require two-pass Viterbi decoding, in contrast to 
conventional MLLR done on 39 dimensions. 

4.1. Review of SPLICE 

As discussed in the introduction, SPLICE algorithm makes the following two assump¬ 
tions: 

(1) The noisy features {y} follow a Gaussian mixture density of M modes 

M M 

(4.1.1) p(y) = ^ P{m)p{y I m) = ^ (y ; 

m=l m=l 

(2) The conditional density p(x | y, m) is the Gaussian 

(4.1.2) p(x I y, m) ~ A( (x + bmj ^x,m} 

where {x} are the clean features. 

Thus, Am and hm parameterise the mixture specific linear transformations on the noisy 
vector y. Here y and m are independent variables, and x is dependent on them. Estimate 


of the cleaned feature x can be obtained in MMSE framework as shown in Eq. ( 2.4.1 ). 

The derivation of SPLIGE transformations is briefly discussed next. Let 

and y' = [l y^]^. Using N independent pairs of stereo training features {(x„,y„)} and 
maximising the joint log-likelihood 


N 


N 


(4.1.3) 


^ = logp(x„, y„) = log 


n=l 


n=l 


M 


p(xn I yn, 'rn)p{yn I m)P{m) 


772=1 
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yields 

(4.1.4) 


W — 

m — 


N 


J2p(^\yn)^nyn 


n=l 


N 


YjP('^\y^)ynyn 


n=l 


Alternatively, sub-optimal update rules of separately estimating and can be 
derived by initially assuming to be identity matrix while estimating b^- The newly 
estimated b^ is then used to estimate A^- 

A perfect correlation between x and y is assumed, and the following approximation is 
used in deriving Eq. ( 4.1.4 ) | Afi^ et al., 2009] |. 

(4.1.5) p {m I x„, y„) Ri p (m | x„) fs p (m | y^) 


Given mixture index m, Eq. ( 4.1.4| ) can be shown to give the MMSE estimator of 
Xjn — Af^y + hm UDeng et al., 2000| ,~given by 


(4.1.6) 
where 

(4.1.7) 


Xm — (y P'y,m) 


N N 

Ylp{^\yn)yr. 

_ 71=1 _ 71=1 

/^X,771 \J 1 


N 

Y.P{rn\yn) 

71=1 


N 

Y.P{rn\yn) 

71=1 


(4.1.8) 


‘^xy,m 


N 

T.p{m\yn)xnyn 

71=1 

N ’ 

T,p{rn\yn) 

n=l 


y^y,m — 


N 

YjP{^\ yn) ynyn 

71=1 

N 

Ep("i| y™) 

71=1 


i.e., the alignments p(m | y„) are being used in place of p(m | x„) and p(m | x„, y„) in Eqs. 
( 4.1. 7| ) and ( 4.1. SP respectively. Thus from ( 4.1.6P , 

(4.1.9) A^ = S 


yi-l 

xy,m^y^rn 


(4.1.10) bm P'x,m ^ixiP'y,m 

To reduce the number of parameters, a simplified model with only bias b^ is proposed 
in literature | ]Deng et al., 2000 1. 

A diagonal version of Eq. ( 4.1.6 ) can be written as 

^2 


(4.1.11) 


Xc - Px,C 


a. 


xy^c 


(J. 


(y Py,c) 


y,c 


where c runs along all components of the features and all mixtures. Since this method does 
not capture all the correlations, it suffers from performance degradation. This shows that 
noise has significant effect on feature correlations. 
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4.2. Proposed Modification to SPLICE 

SPLICE assumes that a perfect correlation exists between clean and noisy stereo fea¬ 
tures (Eq. (|4.1.5|)), which makes the implementation simple [Afify et al., 20091. But, 


the actual feature correlations m are used to train SPLICE parameters, as seen in Eq. 


(4.1.9). Instead, if the training process also assumes perfect correlation and eliminates 


the term during parameter estimation, it complies with the assumptions and gives 

improved performance. This simple modification can be done as follows: 


Eq. (4.1.11) can be rewritten as 


X fix 


a. 


xy 


y- fJ-y 


fj. 


y 


= P 


y- Py 


cr. 


y 


where p = is the correlation coefficient. A perfect correlation implies p = l. Since Eq. 
(|4.1.5|) makes this assumption, it can be enforced in the above equation to obtain 


(4.2.1) 


CJ, 


— Px,c “f {y Py,c) 


fJ. 


y,c 


1 

' 2 


Similarly, for multidimensional case, the matrix 'Sx,m^xy,m^y,m 
be identity as per the assumption. Thus, the following is obtained: 


“f ^x,m^y,m {y P 


(4.2.2) Xm = P-x,m 

Hence M-SPLICE and its updates are defined as 




) 


M 


(4.2.3) 


X = ^ p (m I y) (C^y + d,, 


m=l 


_ 1 

S 2 

y,m 


(4.2.4) Cn, = 

(4.2.5) dm — P'x,m ^ixiP-y^r, 

All the assumptions of conventional SPLICE are valid for M-SPLICE. Comparing both 


the methods, it can be seen from Eqs. ( |4.1.6| ) and ( |4.2.4 ) that while Am is obtained using 
MMSE estimation framework. Cm is based on whitening expression. Also, Am involves 
cross-covariance term 'Sxy,m, whereas Cm does not. The bias terms are computed in the 


same manner, using their respective transformation matrices, as seen in Eqs. (4.1.10) and 


(4.2.5). More analysis on M-SPLICE is given in Section 4.3.1 


4.2.1. Training. The estimation procedure of M-SPLICE transformations is shown in 


Figure 4.2.1a The steps are summarised as follows: 

(1) Build noisy GM]vQp(y) using noisy features {yn} of stereo data. This gives 


and S 


y,m- 


non-standard term noisy mixture has been used to denote a Gaussian mixture built using noisy data. Similar 
meanings apply to clean mixture, noisy GMM and clean GMM. 
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Stereo 

Training 

Data 



(a) M-SPLICE 



(b) Non-Stereo Method 


Figure 4.2.1. Transform estimation block diagrams of proposed methods 


(2) For every noise frame y„, compute the alignment w.r.t. the noisy GMM, i.e., 
p{m I y„). 

(3) Using the alignments of stereo counterparts, compute the means /.Aj. ^ and cov¬ 
ariance matrices 

(4) Compute Cm and dm using Eq. ( 4.2.4 ) and ( 4.2. sp . 


jx^m of each clean mixture from clean data {x„}. 


4.2.2. Testing. Testing process of M-SPLICE is exactly same as that of conventional 
SPLICE, and is summarised as follows: 

(1) Eor each test vector y, compute the alignment w.r.t. the noisy GMM, i.e., p{m \ y). 

(2) Compute the cleaned version as; 

M 

X = ^ p (m I y) (C^y + d^) 

m=l 


4.2.3. M-SPLICE with Diagonal Transformations. Techniques such as CMS, HEQ etc. 
operate on individual feature dimensions, assuming the features have diagonal covariance 
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structures. This assumption is valid for MFCCs, since the use of DCT approximately de¬ 
correlates the features. Without significant loss of performance, M-SPLICE can also be 
extended in a similar fashion by constraining the covariance matrices 
diagonal. Thus Cm becomes diagonal, and Eq. 4.2.3 can be rewritten as 


and to be 


M 

m=l 


where diag (cm) = Cm- This implementation replaces the matrix multiplication in M- 
SPLICE by scalar product and addition operations. 


4.3. Non-Stereo Extension 

This section motivates and proposes the extension of M-SPLICE to datasets which are 
not stereo recorded. However some noisy training utterances, which are not necessarily 
the stereo counterparts of the clean data, are required. 


4.3.1. Motivation. Consider a stereo dataset of W training frames (x„,y„). Suppose 
two M mixture GMMs p(x) and p{y) are independently built using {x„} and {yn} respect¬ 
ively, and each data point is hard-clustered to the mixture giving the highest probability. 
The matrix V mxM, built as described below, is of interest; 


N 

Vij = ^ 1 (x„ G i, yn G j) 

n=l 


where 1() is indicator function. In other words, while parsing the stereo training data, 
when a stereo pair with clean part belonging to clean mixture and noisy part to 
noisy mixture is encountered, the element of the matrix is incremented by unity. Thus 
each element of the matrix denotes the number of stereo pairs belong to the clean 
— noisy mixture-pair. When data are soft assigned to all the mixtures, the matrix can 
instead be built as; 

N 

n=l 


Eigure [4l3.1a| visualises such a matrix built using Aurora-2 stereo training data using 
128 mixture models. A dark spot in the plot represents a higher data count, and a bulk of 
stereo data points do belong to that mixture-pair. 

In conventional SPLICE and M-SPLICE, only the noisy GMM p{y) is built, and not p(x). 
P {'m I yn) are computed for every noisy frame, and the same alignments are assumed for 
the clean frames {x^} while computing ^ and Yx^m- Hence and p{m\y) 

can be considered as the parameters of a clean hypothetical GMM p(x). Now, given these 
GMMs p{y) and p(x), the matrix V can be constructed, which is visualised in Eig. (4.3.1b). 
Since the alignments are same, and clean mixture corresponds to the noisy mixture, 
a diagonal pattern can be seen. 
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Figure 4.3.1. Mixture assignment distribution plots for Aurora-2 stereo 
training data 


Thus, under the assumption of Eq. ( |4.1.5| ), conventional SPLICE and M-SPLICE are able 
to estimate transforms from noisy mixture to exactly clean mixture by maintaining 
the mixture-correspondence. 


When stereo not available, such exact mixture correspondence do not exist. Eig. 4.3.1a 


makes this fact evident, since stereo property was not used while building the two in¬ 
dependent GMMs. However, a sparse structure can be seen, which suggests that for 
most noisy mixtures j, there exists a unique clean mixture i* having highest mixture- 
correspondence. This property can be exploited to estimate piecewise linear transforma¬ 
tions from every mixture j ofp(y) to a single mixture i* ofp(x), ignoring all other mixtures 
i ^ i*. This is the basis for the proposed extension to non-stereo data. 


4.3.2. Implementation. In the absence of stereo data, the approach is to build two 
separate GMMs viz., clean and noisy during training, such that there exists mixture-to- 
mixture correspondence between them, as close to Eig. 4.3.1b as possible. Then whitening 
based transforms can be estimated from each noisy mixture to its corresponding clean 
mixture. This sort of extension is not obvious in the conventional SPLIGE framework, 
since it is not straight-forward to compute the cross-covariance terms without using 

stereo data. Also, M-SPLIGE is expected to work better than SPLIGE due to its advantages 
described earlier. 

The training approach of two mixture-corresponded GMMs is as follows: 


(1) After building the noisy GMM p(y), it is mean adapted by estimating a global 
MLLR transformation using clean training data. The transformed GMM has the 
same covariances and weights, and only means are altered to match the clean 
data. By this process, the mixture correspondences are not lost. 

(2) However, the transformed GMM need not model the clean data accurately. So a 
few (typically three) steps of expectation maximisation (EM) are performed using 
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clean training data, initialising with the transformed GMM. This adjusts all the 
parameters and gives a more accurate representation of the clean GMM p(x). 


Now, the matrix obtained through this method using Aurora-2 training data is visualised 
in Figure 4.3.1c It can be noted that no stereo information has been used while obtaining 
p(x), following the above mentioned steps, from p{y). It can be observed that a diagonal 
pattern is retained, as in the case of M-SPLIGE, though there are some outliers. Since 
stereo information is not used, only comparable performances can be achieved. Figure 


4.2.1b shows the block diagram of estimating transformations of non-stereo method. The 


steps are summarised as follows; 


(1) Build noisy GMM p(y) using noisy features {y}. This gives fiy^ra m- 

(2) Adapt the means of noisy GMM p{y) to clean data {x} using global MLLR trans¬ 
formation. 

(3) Perform at least three EM iterations to refine the adapted GMM using clean data. 
This gives p(x), thus and 'Sx,m- 

(4) Gompute Cm and dm using Eq. ( 4.2.4 ) and ( 4.2. sp . 


The testing process is exactly same as that of M-SPLIGE, as explained in Section 4.2.2 


4.4. Additional Run-time Adaptation 

To improve the performance of the proposed methods during run-time, GMM adapta¬ 
tion to the test condition can be done in both conventional SPLIGE and M-SPLIGE frame¬ 
works in a simple manner. Gonventional MLLR adaptation on HMMs involves two-pass re¬ 
cognition, where the transformation matrices are estimated using the alignments obtained 
through first pass Viterbi-decoded output, and a final recognition is performed using the 
transformed models. 

MLLR adaptation can be used to adapt GMMs in the context of SPLIGE and M-SPLIGE 
as follows: 

(1) Adapt the noisy GMM through a global MLLR mean transformation 


(2) Now, adjust the bias term in conventional SPLIGE or M-SPLIGE as 
14 4 11 — 1/ — C 

This method involves only simple calculation of alignments of the test data w.r.t. the noisy 
GMM, and doesn’t need Viterbi decoding. Glean mixture means px,m computed during 
training need to be stored. A separate global MLLR mean transform can be estimated 
using test utterances belonging to each noise condition. The steps for testing process for 
run-time compensation are summarised as follows: 

(1) For all test vectors {y} belonging to a particular environment, compute the align¬ 
ments w.r.t. the noisy GMM, i.e., p{m \ y). 

(2) Estimate a global MLLR mean transformation using {y}, maximising the likeli¬ 
hood w.r.t. p{y). 
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(3) Compute the adapted noisy GMM p^°‘\y) using the estimated MLLR transform. 


Only the means ^ of the noisy GMM would have been adapted as /ly, 


(a) 


(4) 

(5) 


Using Eq. ( |4.4.1| ), recompute the bias term of SPLIGE or M-SPLIGE. 
Gompute the cleaned test vectors as 


X = 


M 


m=l 


p{m\y) (^C^y+ 


4.5. Experiments and Results 

4.5.1. Experimental Setup. All SPLIGE based linear transformations have been ap¬ 
plied on 13 dimensional MEGGs, including Co- Aurora-2 setup is the same as described 
During HMM training, the features are appended with 13 delta and 13 accel- 


3.6.1 


m 

eration coefficients to get a composite 39 dimensional vector per frame. Gepstral mean 
subtraction (GMS) has been performed in all the experiments. 128 mixture GMMs are 
built for all SPLIGE based experiments. Run-time noise adaptation in SPLIGE framework 
is performed on 13 dimensional MEGGs. Data belonging to each SNR level of a test noise 
condition has been separately used to compute the global transformations. In all SPLIGE 
based experiments, pseudo-cleaning of clean features has been performed. 

To test the efficacy of non-stereo method on a database which does not contain stereo 
data, Aurora-4 task of 8 kHz sampling frequency has been used. Aurora-4 is a continuous 
speech recognition task with clean and noisy training utterances (non-stereo) and test ut¬ 
terances of 14 environments. Aurora-4 acoustic models are built using crossword triphone 
HMMs of 3 states and 6 mixtures per state. Standard WSJO bigram language model has 
been used during decoding of Aurora-4. Noisy GMM of 512 mixtures is built for evaluating 
non-stereo method, using 7138 utterances taken from both clean and multi-training data. 
This GMM is adapted to standard clean training set to get the clean GMM. 


4.5.2. Results. Table summarises the results of various algorithms discussed, on 
Aurora-2 dataset. All the results are shown in % accuracy. All SNRs levels mentioned are in 
decibels. The first seven rows report the overall results on all 10 test noise conditions. The 
rest of the rows report the average values in the SNR range 20 — 0 dB. Table [Tb| shows the 
results of run-time adaptation (indicated as RA) using various methods. Eor reference, the 
result of standard MLLR adaptation on HMMs [ Gales, 1998 1 has been shown in Table [Tbl 
which computes a global 39 dimensional mean transformation, and uses two-pass Viterbi 
decoding. Table shows the experimental results on Aurora-4 database. Table ^ shows 
the results of non-stereo method on Aurora-4 database using clean-trained HMMs. Table 
2b shows the similar results for multi-trained HMMs, using the standard multi-training 
dataset. 

It can be seen that M-SPLIGE improves over SPLIGE at all noise conditions and SNR 
levels and gives an absolute improvement of 8.6% in test-set G and 2.93% overall. Run-time 
compensation in SPLIGE framework gives improvements over standard MLLR in test-sets A 
and B, whereas M-SPLIGE gives improvements in all conditions. Here 9.89% absolute im¬ 
provement can be observed over SPLIGE with run-time noise adaptation, and 4.96% over 
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Table 1. SPLICE-based methods on Aurora-2 database 


(a) Individual Methods 


Noise Level 

Baseline 

SPLICE 

M-SPLICE 

M-SPLICE 

Diagonal 

Non-Stereo 

Method 

Non-Stereo Method 
Diagonal 

Clean 

99.25 

98.97 

99.01 

98.98 

99.08 

99.03 

SNR 20 

97.35 

97.84 

97.92 

97.85 

97.68 

97.67 

SNR 15 

93.43 

95.81 

96.10 

95.78 

95.15 

95.01 

SNR 10 

80.62 

89.48 

91.03 

90.19 

87.37 

86.74 

SNR 5 

51.87 

72.71 

77.59 

75.46 

68.49 

67.35 

SNRO 

24.30 

42.85 

50.72 

46.60 

39.00 

37.76 

SNR -5 

12.03 

18.52 

22.27 

19.39 

16.73 

16.31 

Test A 

67.45 

81.39 

83.47 

81.72 

77.44 

76.64 

Test B 

72.26 

83.24 

84.18 

82.43 

79.63 

79.06 

Test C 

68.14 

69.42 

78.06 

77.57 

73.54 

73.13 

Overall 

69.51 

79.74 

82.67 

81.17 

77.54 

76.91 


(b) Run-time adaptation 


Noise Level 

MLLR 

(39) 

SPLICE 

-t RA 

M-SPLICE 

-t RA 

M-SPLICE 
Diagonal -t 

RA 

Non-Stereo 

Method + 

RA 

Non-Stereo 
Method Diagonal 

+ RA 

Clean 

99.28 

99.05 

99.02 

99.00 

99.08 

99.07 

SNR 20 

98.33 

97.96 

98.18 

98.22 

97.77 

97.75 

SNR 15 

96.82 

96.21 

96.87 

96.70 

95.47 

95.38 

SNR 10 

91.88 

90.61 

93.10 

92.61 

88.80 

88.77 

SNR 5 

73.88 

75.05 

82.00 

81.05 

72.36 

72.34 

SNRO 

41.94 

46.27 

57.51 

55.70 

44.98 

44.84 

SNR -5 

18.71 

20.10 

27.32 

26.30 

20.43 

20.27 

Test A 

79.31 

82.45 

86.47 

85.90 

80.12 

80.01 

Test B 

82.55 

84.09 

85.91 

85.05 

81.67 

81.53 

Test C 

79.14 

73.01 

82.90 

82.37 

75.79 

75.99 

Overall 

80.57 

81.22 

85.53 

84.85 

79.88 

79.82 


Standard MLLR. Finally, non-stereo method, though not using stereo data, shows 10.37% 
and 7.05% absolute improvements over Aurora-2 and Aurora-4 clean baseline models re¬ 
spectively, and a slight degradation w.r.t. SPLICE in all test cases. Run-time noise ad¬ 
aptation results of non-stereo method are comparable to that of standard MLLR, and are 
computationally less expensive. It can be observed that non-stereo method gives perform¬ 
ance similar to that of multi-condition training. 
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Table 2. Non-Stereo method on Aurora-4 database 


(a) Clean-Training 



Clean 

Car 

Babble 

Street 

Restaurant 

Airport 

Station 

Average 

Baseline 

Mic-1 

87.63 

75.58 

52.77 

52.83 

46.53 

56.38 

45.30 

54.73 

Mic-2 

77.40 

64.39 

45.15 

42.03 

36.26 

47.69 

36.32 

Non-Stereo 

Method 

Mic-1 

87.39 

77.21 

62.79 

59.29 

57.43 

60.9 

57.99 

61.78 

Mic-2 

78.95 

67.18 

55.41 

50.72 

46.31 

54.55 

48.76 


(b) Multi-Training 



Clean 

Car 

Babble 

Street 

Restaurant 

Airport 

Station 

Average 

Baseline 

Mic-1 

86.89 

82.83 

70.52 

68.52 

65.37 

72.73 

64.28 

70.16 

Mic-2 

82.29 

77.36 

65.48 

62.23 

57.61 

67.29 

58.77 

Non-Stereo 

Method 

Mic-1 

86.21 

82.03 

71.64 

67.59 

66.47 

71.32 

65.89 

69.98 

Mic-2 

80.59 

75.68 

66.41 

62.11 

58.68 

65.31 

59.85 


4.6. Discussion 

In terms of computational cost, the methods M-SPLICE and non-stereo methods are 
identical during testing as compared to conventional SPLICE. Also, there is almost neg¬ 
ligible increase in cost during training. The MLLR mean adaptation in both non-stereo 
method and run-time adaptation are computationally very efficient, and do not need Vi- 
terbi decoding. The diagonal versions of the proposed methods give comparable perform¬ 
ances. 

In terms of performance, M-SPLICE is able to achieve good results in all cases without 
any use of adaptation data, especially in unseen cases. In non-stereo method, one-to-one 
mixture correspondence is assumed between noise and clean GMMs. The method gives 
slight degradation in performance. This could be attributed to neglecting the outlier data. 

Comparing with other existing feature normalisation techniques, the techniques in 
SPLICE framework operate on individual feature vectors, and no estimation of parameters 
is required from test data. So these methods do not suffer from test data insufficiency 
problems, and are advantageous for shorter utterances. Also, the testing process is usually 
faster, and are easily implementable in real-time applications. So by extending the methods 
to non-stereo data, we believe that they become more useful in many applications. 






































CHAPTER 5 


Conclusion and Future Work 


In this thesis, feature normalisation methods suitable for noise robust real-time ASR 
have been studied. When there is no information about noise, it has been shown that an 
additional feature processing block in the MFCC extraction process can be included for 
noise-robustness. The additional block rebuilds all the speech frames from non-negative 
linear combinations of the speech subspace basis vectors in the LMFB domain. The new 
features are also shown to give improved performance when used with the existing tech¬ 
niques such as HEQ and HLDA. 

In future, the methods may be improved by imposing sparseness constraints for learn¬ 
ing better and more meaningful speech dictionaries. Also, the current methods involve 
iterative estimation of weights H during testing. Though the implementation is simple, 
owing to its slow convergence a large number of iterations is required. Addressing this 
issue could be another possible future work. The methods may be implemented in higher 
dimensional spaces for improvements, possibly involving DFT of larger number of samples 
during feature extraction. 

In the presence of stereo training data, a modified version of the SPLICE algorithm 
has been proposed for noise robust ASR. It is better compliant with the assumptions of 
SPLICE, and improves the recognition in highly mismatched and unseen noise conditions. 
An extension of the methods to non-stereo data has been presented. Finally, a conveni¬ 
ent run-time adaptation framework has been explained, which is computationally much 
cheaper than the standard MLLR adaptation of HMMs. 

In future, the efficiency of the non-stereo extension of SPLICE can be improved. Better 
techniques could be proposed to achieve this, which either give fewer outliers in their mix¬ 
ture distribution plot, or do not neglect the outlier data. M-SPLICE could also be extended 
in uncertainty decoding framework, which has gained popularity [ ]Droppo et al., 2002 1 
over the conventional SPLICE. 
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Derivation of SPLICE Parameters 


Let Wm = [bm Am] and y' = ^ . (x„, y^) can be substituted in 


p(x|y,m) ~ AA(x + 


to obtain 


p{xn I yn,m) = 


(27r)2 |Sa;,m|2 

Now, the log-likelihood function can be expanded as 


g-i(x-W,„yJ,rS-l„(x-W,„y;) 


C = log 


N 


]Qp(x„,y„) 


.n=l 

N 


N 


J^log 


n=l \_m=l 


M 


y]p(x„,y„,m) 


M 


£ = y]log y~] p (x„ I y^, m) p (yn | m) P (m) 

n=l Lm=l 

Differentiating w.r.t and equating to zero, and using the matrix identity 

d 

(x — As) B (x — As) = —2B (x — As) s^ 


dC 


^ I 

= y —/ - \P (yn I m) P {m)p(Xn I yn, m) m (xn “ W^yn) Yn = 0 

^p(x„,yn) 


N 


,/T 


= 0 


^ X] P ("i I X„, y„) (x,^ - WmYn) Yr 

\n=l / 

SPLICE assumes a perfect correlation between x and y, so 

p (m I x„, y„) « p (m I x„) Ri p (m I yn) 

Since is non-singular and the above matrix product is zero, it can be proved that 

N 

y] P (m I yn) (xn - Wmyn) Yn = 0 

71=1 

Solving for Wm yields 


Wm = 


' N 1 r ^ 

y]p(m|yn)xnyn^ X] P | yn) YnY! 


_n=l 


-1 -1 


“ n 


. \_n=l 
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